need help with regular expression

need help with regular expression

am 14.06.2005 23:17:50 von Praedor Atrebates

I am dinking around with perl for bioinformatics purposes. I have written a
small perl script that reads FASTA formatted sequence files and searches the
sequence therein for user-entered sequences.

This is primarily targetted at protein sequence analysis and for my purposes,
I've included - or am trying to include - one hard-coded sequence string.

I have this (pertinent) code in my script:

$dnakmotif = '[KRH][L{3,}V{3,}I{3,}F{3,}Y{3,}A{3,}][KRH];

print "Enter the motif/amino acid sequence pattern to search for:\n";
$motif = ;
chomp $motif;

if ($motif =~ 'dnak') {
$motif = $dnakmotif;
}

Thus, if I enter the string "dnak" I want my query to be set to the value of
$dnakmotif.

I am just learning as I go here but there are two problems with this, one of
which I understand but the other I do not. First, the one I do not
understand. My intent with the value set to $dnakmotif was to search for K
or R or H in a sequence followed by a string of 3 or more of any of the
contents of the second bracket pair (L V I F Y A). When I run the program
and run a search for "dnak" I get a string of hits in my test sequence they
don't match what I am after. I get a series of hits, for instance, of a K
followed by ONE A or ONE V followed by an H instead of at LEAST 3 of any of
L or V or I, etc. Why doesn't this work?

The next problem is one I understand but have no idea how to correct. The
value I set to $dnakmotif is too restrictive for the actual searches I need
to do. What I want is to search for a sequence/character string with any of
K or R or H on either end, but _between them_ any combination of L, V, I, F,
A, or Y is OK, in repeats or all individually so long as the minimum number
is 3 and the max number (any combination of the characters) is _no more_ than
5. How do I make this character search be much less restrictive than I've
started out with?

Thank you for any aid,
praedor
--
"Voice or no voice, the people can always be brought to the bidding of
the leaders. That is easy. All you have to do is tell them they are
being attacked, and denounce the peacemakers for lack of patriotism and
exposing the country to danger. It works the same in any country."
--Hermann Goering

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org

Re: need help with regular expression

am 15.06.2005 00:33:14 von krahnj

Praedor Atrebates wrote:
> I am dinking around with perl for bioinformatics purposes. I have written a
> small perl script that reads FASTA formatted sequence files and searches the
> sequence therein for user-entered sequences.
>
> This is primarily targetted at protein sequence analysis and for my purposes,
> I've included - or am trying to include - one hard-coded sequence string.
>
> I have this (pertinent) code in my script:
>
> $dnakmotif = '[KRH][L{3,}V{3,}I{3,}F{3,}Y{3,}A{3,}][KRH];
>
> print "Enter the motif/amino acid sequence pattern to search for:\n";
> $motif = ;
> chomp $motif;
>
> if ($motif =~ 'dnak') {
> $motif = $dnakmotif;
> }
>
> Thus, if I enter the string "dnak" I want my query to be set to the value of
> $dnakmotif.

It sound like you could use a hash:

my %sequences = (
dnak => qr/pattern1/,
dank => qr/pattern2/,
nakd => qr/pattern3/,
);

if ( exists $sequences{ $motif } ) {
# do something with pattern in $sequences{ $motif }
}

> I am just learning as I go here but there are two problems with this, one of
> which I understand but the other I do not. First, the one I do not
> understand. My intent with the value set to $dnakmotif was to search for K
> or R or H in a sequence followed by a string of 3 or more of any of the
> contents of the second bracket pair (L V I F Y A). When I run the program
> and run a search for "dnak" I get a string of hits in my test sequence they
> don't match what I am after. I get a series of hits, for instance, of a K
> followed by ONE A or ONE V followed by an H instead of at LEAST 3 of any of
> L or V or I, etc. Why doesn't this work?

Anything inside the [] brackets is a character class so
[L{3,}V{3,}I{3,}F{3,}Y{3,}A{3,}] says match ONE of either 'L', 'V', 'I', 'F',
'Y', 'A', '{', '}', '3' or ',' and since duplicates are ignored it could be
written as [AFILVY{}3,].

> The next problem is one I understand but have no idea how to correct. The
> value I set to $dnakmotif is too restrictive for the actual searches I need
> to do. What I want is to search for a sequence/character string with any of
> K or R or H on either end, but _between them_ any combination of L, V, I, F,
> A, or Y is OK, in repeats or all individually so long as the minimum number
> is 3 and the max number (any combination of the characters) is _no more_ than
> 5. How do I make this character search be much less restrictive than I've
> started out with?

It SOUNDS like you want: /[KRH][LVIFYA]{3,5}[KRH]/



John
--
use Perl;
program
fulfillment

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org

Re: need help with regular expression

am 15.06.2005 00:34:22 von japhy

On Jun 14, Praedor Atrebates said:

> $dnakmotif = '[KRH][L{3,}V{3,}I{3,}F{3,}Y{3,}A{3,}][KRH];

> I am just learning as I go here but there are two problems with this, one of
> which I understand but the other I do not. First, the one I do not
> understand. My intent with the value set to $dnakmotif was to search for K
> or R or H in a sequence followed by a string of 3 or more of any of the
> contents of the second bracket pair (L V I F Y A). When I run the program
> and run a search for "dnak" I get a string of hits in my test sequence they
> don't match what I am after. I get a series of hits, for instance, of a K
> followed by ONE A or ONE V followed by an H instead of at LEAST 3 of any of
> L or V or I, etc. Why doesn't this work?

The [...] construct is a character class -- it represents a set of
characters, any of which can match. Thus, [KRH] matches a 'K', an 'R', or
an 'H'. But [A{3,}B{3,}] is really just the same as [AB3,{}] -- that is,
an 'A', a 'B', a '3', a ',', a '{', or a '}'. What you want is

$dnakmotif = qr/[KRH](?:L{3,}|V{3,}|I{3,}|F{3,}|Y{3,}|A{3,})[KRH]/;

That sounds like it should match what you're looking for.

> The next problem is one I understand but have no idea how to correct. The
> value I set to $dnakmotif is too restrictive for the actual searches I need
> to do. What I want is to search for a sequence/character string with any of
> K or R or H on either end, but _between them_ any combination of L, V, I, F,
> A, or Y is OK, in repeats or all individually so long as the minimum number
> is 3 and the max number (any combination of the characters) is _no more_ than
> 5. How do I make this character search be much less restrictive than I've
> started out with?

Hrm, so the middle part must be NO LESS than 3 and NO MORE than 5, and is
made up solely of L, V, I, F, Y, and A characters?

$dnakmotif = qr/[KRH][LVIFYA]{3,5}[KRH]/;

looks like it does the trick.

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
http://japhy.perlmonk.org/ % have long ago been overpaid?
http://www.perlmonks.org/ % -- Meister Eckhart

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org

Re: ???UNSURE??? Re: need help with regular expression

am 15.06.2005 01:34:17 von Praedor Atrebates

On Tuesday 14 June 2005 17:34, Jeff 'japhy' Pinyan wrote:
> On Jun 14, Praedor Atrebates said:
> > $dnakmotif = '[KRH][L{3,}V{3,}I{3,}F{3,}Y{3,}A{3,}][KRH];
> >
> > I am just learning as I go here but there are two problems with this, one
> > of which I understand but the other I do not. First, the one I do not
> > understand. My intent with the value set to $dnakmotif was to search for
> > K or R or H in a sequence followed by a string of 3 or more of any of the
> > contents of the second bracket pair (L V I F Y A). When I run the
> > program and run a search for "dnak" I get a string of hits in my test
> > sequence they don't match what I am after. I get a series of hits, for
> > instance, of a K followed by ONE A or ONE V followed by an H instead of
> > at LEAST 3 of any of L or V or I, etc. Why doesn't this work?
>
> The [...] construct is a character class -- it represents a set of
> characters, any of which can match. Thus, [KRH] matches a 'K', an 'R', or
> an 'H'. But [A{3,}B{3,}] is really just the same as [AB3,{}] -- that is,
> an 'A', a 'B', a '3', a ',', a '{', or a '}'. What you want is
>
> $dnakmotif = qr/[KRH](?:L{3,}|V{3,}|I{3,}|F{3,}|Y{3,}|A{3,})[KRH]/;
>
> That sounds like it should match what you're looking for.
>
> > The next problem is one I understand but have no idea how to correct.
> > The value I set to $dnakmotif is too restrictive for the actual searches
> > I need to do. What I want is to search for a sequence/character string
[...]

I may add one or two more amino acids to the middle portion as possibles but
for now that is the rule I am trying to get working. In what you offer, what
does the leading "qr/" mean and what of the "?:L..."? None of the internal
letters MUST repeat but the CAN. There could be no repeats with position
filled by a unique 3, 4, or 5 characters from the list or it could be
entirely one character repeated anywhere between 3 to 5 times to a
combination of repeats and singles.

praedor
--
"Voice or no voice, the people can always be brought to the bidding of
the leaders. That is easy. All you have to do is tell them they are
being attacked, and denounce the peacemakers for lack of patriotism and
exposing the country to danger. It works the same in any country."
--Hermann Goering

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org

Re: ???UNSURE??? Re: need help with regular expression

am 15.06.2005 01:55:16 von japhy

On Jun 14, Praedor Atrebates said:

> On Tuesday 14 June 2005 17:34, Jeff 'japhy' Pinyan wrote:
>
>> The [...] construct is a character class -- it represents a set of
>> characters, any of which can match. Thus, [KRH] matches a 'K', an 'R', or
>> an 'H'. But [A{3,}B{3,}] is really just the same as [AB3,{}] -- that is,
>> an 'A', a 'B', a '3', a ',', a '{', or a '}'. What you want is
>>
>> $dnakmotif = qr/[KRH](?:L{3,}|V{3,}|I{3,}|F{3,}|Y{3,}|A{3,})[KRH]/;
>>
>> That sounds like it should match what you're looking for.
>
> I may add one or two more amino acids to the middle portion as possibles but
> for now that is the rule I am trying to get working. In what you offer, what
> does the leading "qr/" mean and what of the "?:L..."? None of the internal
> letters MUST repeat but the CAN. There could be no repeats with position
> filled by a unique 3, 4, or 5 characters from the list or it could be
> entirely one character repeated anywhere between 3 to 5 times to a
> combination of repeats and singles.

The qr/.../ construct *creates* a compiled regex, that you can then use
later. The inside of it is parsed like a regex (not like a normal quoted
string).

my $rx = qr/[KRH][LVIFAY]{3,5}[KRH]/;

if ($str =~ /$rx/) { something }

The (?:...) part of the regex quoted above is a grouping construct that
does not capture to a $DIGIT variable.

/abc(def|ghi)jkl/

captures 'def' or 'ghi' (whichever matched) to $1, but

/abc(?:def|ghi)jkl/

does not capture anything.

I'd say you want to go ahead and use

qr/[KRH][LVIFAY]{3,5}[KRH]/;

for now, until you can come up with a more complex definition of the
interior of your sequence.

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
http://japhy.perlmonk.org/ % have long ago been overpaid?
http://www.perlmonks.org/ % -- Meister Eckhart

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org